All Databases MacTech Vol 08-1992

FORTRAN Comparisons

Volume Number: 8

Issue Number: 7

Column Tag: Jörg's Folder

FORTRAN Comparisons Ä

The sequel.

By Jörg Langowski, MacTutor Regular Contributing Author

After my September column on numerical precision in the two competing

FORTRAN compilers from Absoft and Language Systems, we received an irate comment

from an Absoft user that this comparison was not fair and I hadn’t compared both

products to their maximum advantage. Excerpt (the less irate part of it) follows, the

author wished that his name not be divulged:

“ Mr. Langowski runs his benchmarks using opt=3 for Language Systems but

only basic optimization for Absoft. This is patently unfair. Absoft has loop unrolling and

subroutine inclusion opts that he ignores. These greatly speed up the benchmarks. He

also penalizes Absoft by including the -e option.

I have run similar comparisons with Absoft armed in double precision mode, so

all arithmetic is comparable in accuracy. For the Whetstones, on an FX I get over 6000

with Absoft, which is better than what he gets with a Quadra 700. With a Quadra 950, I

get over 13,800 MWhets from Absoft, while I get roughly 5600 from Language

Systems.

I should point out that the Absoft manual discusses the problem of extended

arithmetic and comparisons, and notifies you of the calls to arm the FPU. I agree that

Language Systems has better support for things like Apple Events etc., but lets make

sure comparisons are fair. I should also mention that Absoft routinely compiles my

code, at all comparable levels of optimization, faster than does Language Systems. ”

In fact, to recommend only one of the compilers was admittedly a little too

strong. I may start with the conclusion of this column, that really both Absoft and LS

Fortran compilers are very good products and have their advantages and disadvantages.

We contacted Absoft, who had been suspiciously silent throughout all this, and asked

them to express their views. We received a very constructive letter that gives a lot of

insight into the tradeoffs that compiler makers have to deal with. Here it is:

“Dear Mr. Langowski,

We would like take to this opportunity to comment on your September column in

MacTutor magazine. It was stated in the article that “Absoft has recently announced

MacFortran II version 3.1.” We actually began shipping version 3.1.2 of MacFortran II

in October of 1991. In addition to the enhancements to the user interface, this version

includes a code generator which takes advantage of the new 68040 floating point

instructions and a math library for FORTRAN intrinsic functions which is based on the

Motorola transcendental function library intended to be used with 68040 based

machines. [in fact, I had used 3.1.2 in the tests; 3.1 was a typo. sorry. -JL].

A significant portion of the article discusses the Paranoia program which, when

compiled “with Absoft MacFortran 3.1.2, with and without optimizations, produces a

lot of error messagestypical of floating point implementations where roundoff is not

handled correctly.” The method developed in the article to achieve a diagnostic free

result by turning off optimization and using the -e option to prevent the compiler from

maintaining variables in registers is not the solution we would have chosen or

recommended. The Paranoia program can also be successfully negotiated by simply

setting the rounding precision of the floating point unit to the precision of the

benchmark. This procedure is described on pages 5-13 through 5-15 of the Porting

Code chapter of our manual. It has the advantage of not obviating optimization and allows

the compiler to maintain values in registers while still performing rounding to the

width of the variable. However, this still leaves the question of whether it is valid for a

compiler to maintain values in registers as long as possible. We feel that it is, although

we do recognize that there are circumstances where control over the side effects of this

optimization must be made available to the programmer. In particular, we provide

several options and mechanisms to assist in the development of numerically sensitive

programs on machines where the register file is wider than main storage or where fast,

but not necessarily IEEE conforming instructions are present. The MC68040 provides

single and double precision rounded basic operations whose use can improve

performance at the cost of extended precision intermediates. In addition, the VOLATILE

statement (a VAX extension) allows control over individual variables.

To further illustrate the situation, consider the following program:

C 1
 a = 1.0
 b = 3.0
 c = a/b
 if (c .eq. a/b) print *,'equal'
 end

With or without optimization, the MacFortran II compiler generates a program

that correctly prints the string “equal”. On the other hand, under the same conditions,

the Language Systems FORTRAN compiler produces a program that is silent. Does this

indicate problems with the arithmetic generated by the Language Systems compiler? An

inspection of the generated code clearly shows no errors. The Language Systems

compiler exhibits seemingly anomalous behavior precisely because it does not maintain

the variable “c” in a register; its’ precision is truncated to 32 bits when it is stored

in memory, but the comparison with the reloaded variable is made against the full 96

bit result of the division. The Language Systems “-ansi” switch will generate code

which will compare successfully, but I am certain they would not recommend

indiscriminate use of the option. What this example (and Paranoia) does point out are

the problems that a programmer might encounter on a machine where the width of the

compiler for Intel based computers will also fail on this example if optimization is

turned off (Intel floating point units are 80 bits wide).

The section of the article which describes the results of the speed tests begins

with the cautionary remark “we should therefore use the Absoft compiler at least with

the -e option, and maybe also drop the optimizations.” We urge you to reconsider your

conclusions as there is a large body of problems that is not sensitive to environments

where greater precision is maintained in registers than in memory. To relegate these

programs to slower than optimal performance achieves no useful end. Numerically

sensitive programs that explore the boundaries of precision are often better served by

setting the floating point unit to the rounding state they expect.

We noticed that several recommended options were not used when running the

benchmarks. In particular, subroutine folding and loop unrolling. Although we have not

used the Whetstone benchmark for comparison with our competitors on the Macintosh

for over a year, it can dramatically demonstrate some of the performance benefits of

certain optimization techniques. The real advantage of subroutine inlining or folding is

typically not the elimination of the call-return sequence, but rather the opportunities

for further optimizations that it exposes to the compiler. This is in fact what happens

when the P3 subroutine is folded in the Whetstone benchmark. The compiler is able to

determine that the loop is completely invariant and can set the result values without

performing a single loop iteration. As small encapsulated functions dictated by modern

programming paradigms become more commonplace, this optimization technique will

yield even greater performance improvements.

When the -O option (basic optimizations) is used, innermost loops which consist

of a single executable statement are automatically unrolled as is indicated on page 4-16

of our manual. This is the case in the saxpy subroutine in the Linpack benchmark. As

instruction and data cache sizes become larger, multiple execution units (super-scalar

processors) are introduced, and register files are expanded, loop unrolling becomes a

very powerful optimization. It allows a compiler to maintain more values in registers,

schedule code for the various execution units, and group data loads and stores in a

attempt to minimize memory traffic.

When comparing the capabilities of two different compilers on the same piece of

hardware, we feel that they each should be shown off to their greatest advantage.

Sincerely,

Peter A. Jacobson”

Thank you very much for taking the time to reply. It was an oversight that I had

not looked into changing the precision of the FPU for running Paranoia. When you set

the FPU to single precision, the generated code passes in fact all the numerical

accuracy tests. (for an example how to do this, see the listing).

It is probably a question of philosophy how to handle the situation when the

if you keep intermediate values in registers during subexpression evaluation, you may

get results that differ slightly from the same set of Fortran instructions when the

intermediate results have been stored in memory. The speed advantage of using internal

registers to their full extent then goes together with the need for controlling yourself

the FPU precision in numerically sensitive parts of your code; if such a piece of code is

written using 32-bit single precision variables, you should set the FPU to single

precision rounding, and reset it to its original state after you’re done with that

particular routine. I suppose that is not too much work when you are developing e.g., a

fast math library.

But let’s look at the optimization issue, which is where the two compilers really

differ. In order to gain a fair comparison on the general-purpose quality of the code

produced by both compilers, I chose those optimization levels on both systems that gave

the best results on the Linpack benchmark (and incidentally also on a Monte-Carlo and

a Brownian Dynamics simulation ‘real-world’ problem that I am currently dealing

with). Those parameters were for LS Fortran: opt=3 and for Absoft: basic

optimizations, no subroutine folding, loop unrolling level 2 or none at all (no big

Referenced by (2):